Exploratory Data Analysis

\(\hspace{0.3cm}\) More articles: \(\hspace{0.1cm}\) Estadistica4all

\(\hspace{0.3cm}\) Author: \(\hspace{0.1cm}\) Fabio Scielzo Ortiz

\(\hspace{0.3cm}\) If you use this article, please, reference it:

\(\hspace{0.5cm}\) Scielzo Ortiz, Fabio. (2023). Exploratory Data Analysis. http://estadistica4all.com/Articulos/EDA.html

It’s recommended to open the article on a computer or tablet.


1 Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) refers to the descriptive statistical analysis of a data-set.

Next we are going to propose a methodology to carry out an EDA, using Python as programming lenguage.

2 Data Pre-processing

2.1 Import data-set

First of all, we import the data-set with which we will work.

import pandas as pd

Netflix_Data = pd.read_csv('titles.csv')
Netflix_Data
id title type description release_year age_certification runtime genres production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score
0 ts300399 Five Came Back: The Reference Films SHOW This collection includes 12 World War II-era p… 1945.0 TV-MA 51 [‘documentation’] [‘US’] 1.0 NaN NaN NaN 0.600 NaN
1 tm84618 Taxi Driver MOVIE A mentally unstable Vietnam War veteran works … 1976.0 R 114 [‘drama’, ‘crime’] [‘US’] NaN tt0075314 8.2 808582.0 40.965 8.179
2 tm154986 Deliverance MOVIE Intent on seeing the Cahulawassee River before… 1972.0 R 109 [‘drama’, ‘action’, ‘thriller’, ‘european’] [‘US’] NaN tt0068473 7.7 107673.0 10.010 7.300
3 tm127384 Monty Python and the Holy Grail MOVIE King Arthur, accompanied by his squire, recrui… 1975.0 PG 91 [‘fantasy’, ‘action’, ‘comedy’] [‘GB’] NaN tt0071853 8.2 534486.0 15.461 7.811
4 tm120801 The Dirty Dozen MOVIE 12 American military prisoners in World War II… 1967.0 NaN 150 [‘war’, ‘action’] [‘GB’, ‘US’] NaN tt0061578 7.7 72662.0 20.398 7.600
5845 tm1014599 Fine Wine MOVIE A beautiful love story that can happen between… 2021.0 NaN 100 [‘romance’, ‘drama’] [‘NG’] NaN tt13857480 6.8 45.0 1.466 NaN
5846 tm898842 C/O Kaadhal MOVIE A heart warming film that explores the concept… 2021.0 NaN 134 [‘drama’] [] NaN tt11803618 7.7 348.0 NaN NaN
5847 tm1059008 Lokillo MOVIE A controversial TV host and comedian who has b… 2021.0 NaN 90 [‘comedy’] [‘CO’] NaN tt14585902 3.8 68.0 26.005 6.300
5848 tm1035612 Dad Stop Embarrassing Me - The Afterparty MOVIE Jamie Foxx, David Alan Grier and more from the… 2021.0 PG-13 37 [] [‘US’] NaN NaN NaN NaN 1.296 10.000
5849 ts271048 Mighty Little Bheem: Kite Festival SHOW With winter behind them, Bheem and his townspe… 2021.0 NaN 7 [‘family’, ‘animation’, ‘comedy’] [] 1.0 tt13711094 7.8 18.0 2.289 10.000

5850 rows × 15 columns


2.2 Data-set conceptual description

This data-set has information about 15 variables on 5850 Netflix titles.

Next table has a brief conceptual description about data-set variables:

Variable Descripción Tipo
id The title ID on JustWatch Identifier
title The name of the title Text
type TV show or movie Categorical
description A brief description Text
release_year release year Quantitative
age_certification age rating Categorical
runtime the number of episodes (show), the duration time in minutes (movie) Quantitative
genres A list of genres Categorical
production_countries A list of countries that produced the title Categorical
seasons Number of seasons if it’s a SHOW Quantitative
imdb_id The title ID on IMDB Identifier
imdb_score Rating on IMDB Quantitative
imdb_votes number of votes on IMDB Quantitative
tmdb_popularity Popularity on TMDB Quantitative
tmdb_score Rating on TMDB Quantitative


2.3 Data-set size

We can get the data-set size as the number of rows and columns of the data-set.

Netflix_Data.shape
(5850, 15)

As discussed above, the data-set has 5850 rows and 15 columns.


2.4 info() method

info() method give us column names, number of non null values in each column and column type.

Netflix_Data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5850 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5850 non-null   object 
 1   title                 5849 non-null   object 
 2   type                  5850 non-null   object 
 3   description           5832 non-null   object 
 4   release_year          5850 non-null   int64  
 5   age_certification     3231 non-null   object 
 6   runtime               5850 non-null   int64  
 7   genres                5850 non-null   object 
 8   production_countries  5850 non-null   object 
 9   seasons               2106 non-null   float64
 10  imdb_id               5447 non-null   object 
 11  imdb_score            5368 non-null   float64
 12  imdb_votes            5352 non-null   float64
 13  tmdb_popularity       5759 non-null   float64
 14  tmdb_score            5539 non-null   float64
dtypes: float64(5), int64(2), object(8)
memory usage: 685.7+ KB


2.5 Column types

There is another way to get column types.

Netflix_Data.dtypes
id                       object
title                    object
type                     object
description              object
release_year              int64
age_certification        object
runtime                   int64
genres                   object
production_countries     object
seasons                 float64
imdb_id                  object
imdb_score              float64
imdb_votes              float64
tmdb_popularity         float64
tmdb_score              float64
dtype: object

Object is the typical type of categorical variables, identifier or text.

Float64 and int64 is the typical type of quantitative variables, float64 for continuous one, and int64 for discrete one.


2.6 Change column types

We can change the type of a column with astype() method:

Netflix_Data['release_year'] = Netflix_Data['release_year'].astype('float')

We can check if the changes have been done correctly:

Netflix_Data.dtypes
id                       object
title                    object
type                     object
description              object
release_year            float64
age_certification        object
runtime                   int64
genres                   object
production_countries     object
seasons                 float64
imdb_id                  object
imdb_score              float64
imdb_votes              float64
tmdb_popularity         float64
tmdb_score              float64
dtype: object


2.7 Unique values of a variable

We can get the uniques values of a variable with the unique() method.

We can get the uniques values of type as following :

Netflix_Data['type'].unique()
array(['SHOW', 'MOVIE'], dtype=object)

\(\\\)

We can get the uniques values of age_certification as following :

Netflix_Data['age_certification'].unique()
array(['TV-MA', 'R', 'PG', nan, 'TV-14', 'PG-13', 'TV-PG', 'TV-Y', 'TV-G', 'TV-Y7', 'G', 'NC-17'], dtype=object)

\(\\\)

We can get the uniques values of production_countries as following :

Netflix_Data['production_countries'].unique()     
array(["['US']", "['GB']", "['GB', 'US']", "['EG']", "['DE']", "['IN']",
       "['SU', 'IN']", "['LB', 'CA', 'FR']", '[]', "['LB']",
       "['DZ', 'EG']", "['CA', 'FR', 'LB']", "['US', 'GB']",
       "['US', 'IT']", "['JP']", "['AR']", "['FR', 'EG']", "['FR', 'LB']",
       "['CA', 'US']", "['US', 'FR']", "['JP', 'US']", "['US', 'CA']",
       "['DE', 'US']", "['PE', 'US', 'BR']", "['IT', 'US', 'FR']",
       "['IE', 'GB', 'DE', 'FR']", "['HK', 'US']", "['AU']", "['FR']",
       "['DE', 'GH', 'GB', 'US', 'BF']", "['MX']", "['ES', 'AR']",
       "['CO']", "['PS', 'US', 'FR', 'DE']", "['FR', 'NO', 'LB', 'BE']",
       "['BE', 'FR', 'IT', 'LB']", "['TR']", "['IN', 'SU']", "['DK']",
       "['CA']", "['DE', 'GB', 'US', 'BS', 'CZ']", "['MT', 'GB', 'US']",
       "['AU', 'DE', 'GB', 'US']", "['US', 'JP']", "['BE', 'US']",
       "['HK']", "['IT']", "['US', 'FR', 'DE', 'GB']",
       "['GB', 'US', 'FR', 'DE']", "['IT', 'US']", "['US', 'ZA']",
       "['GB', 'ES']", "['GB', 'US', 'JP']", "['HK', 'CN']",
       "['GB', 'US', 'BG']", "['RU']", "['KR']", "['CA', 'US', 'IN']",
       "['CN']", "['JP', 'HK']", "['CA', 'GB', 'US']",
       "['FR', 'MX', 'ES']", "['IN', 'US']", "['AR', 'ES']", "['CL']",
       "['FR', 'MA', 'DE', 'PS']", "['AR', 'DE', 'UY', 'ES']",
       "['CL', 'AR']", "['CZ', 'GB', 'DK', 'NL', 'SE']", "['TW']",
       "['SG']", "['NG']", "['MY']", "['Lebanon']",
       "['BE', 'FR', 'ES', 'CH', 'PS']", "['ZA']", "['NG', 'US']",
       "['LB', 'FR']", "['CN', 'HK']", "['PH']", "['LB', 'GB', 'FR']",
       "['FR', 'DE', 'KW', 'PS']", "['PS']",
       "['GB', 'US', 'AT', 'FR', 'DE', 'NG']", "['XX']", "['AE', 'US']",
       "['DK', 'US']", "['FR', 'US', 'GB']", "['HU', 'US', 'CA']",
       "['NO']", "['GB', 'FR', 'DE']", "['US', 'HU', 'IT']",
       "['US', 'ZA', 'DE']", "['IN', 'DE']", "['SA']", "['ID']",
       "['US', 'LB', 'AE']", "['PS', 'NL', 'US', 'AE']",
       "['US', 'FR', 'GB']", "['US', 'DE', 'GB']", "['GB', 'ZA']",
       "['US', 'CA', 'CL']", "['US', 'GB', 'CN', 'CA']",
       "['AU', 'CH', 'GB']", "['ES']", "['FI']", "['IL']", "['FR', 'US']",
       "['AU', 'US']", "['CA', 'US', 'GB']", "['AT']", "['CD', 'GB']",
       "['US', 'BR']", "['CA', 'JP', 'US']", "['CA', 'KR']",
       "['US', 'EG', 'GB']", "['BR']", "['PL']", "['VE', 'AR']", "['RO']",
       "['IL', 'NO', 'ZA', 'AE', 'GB', 'IS', 'IE']",
       "['US', 'CN', 'DE', 'SG', 'UA']", "['DE', 'IT', 'PS', 'FR']",
       "['AE', 'LB']", "['LB', 'AE']", "['US', 'ES']", "['NZ']",
       "['GB', 'US', 'FR']", "['US', 'FR', 'LU', 'GB']", "['FR', 'BE']",
       "['IT', 'GB']", "['US', 'CA', 'GB']", "['CA', 'FR']",
       "['US', 'CN']", "['UA']", "['MX', 'ZA', 'US']",
       "['US', 'GB', 'ES']", "['BE', 'DK', 'DE', 'GB', 'US']",
       "['GB', 'IR', 'JO', 'QA']", "['CH', 'US']", "['CA', 'DE', 'GB']",
       "['GH', 'US']", "['IE', 'GB']", "['CN', 'US']",
       "['UA', 'GB', 'US']", "['IE', 'ZA']", "['US', 'FR', 'MT']",
       "['BG']", "['GB', 'FR']", "['BY']", "['IE']", "['IS']",
       "['AU', 'FR', 'DE']", "['CN', 'FR', 'CA']", "['FR', 'QA']",
       "['SE']", "['FR', 'ES']", "['NL']", "['HR']", "['FR', 'MA']",
       "['RU', 'US', 'FR']", "['SY', 'GB']", "['AT', 'US']", "['CD']",
       "['FR', 'CL']", "['AU', 'GB']", "['TN']", "['AE']", "['SE', 'NO']",
       "['GL', 'FR']", "['LB', 'DE']", "['PT', 'SE', 'DK', 'BR', 'FR']",
       "['QA', 'LB']", "['GB', 'AU', 'US']", "['ES', 'DK']",
       "['AE', 'FR', 'JO', 'LB', 'QA', 'PS']", "['US', 'CA', 'JP']",
       "['PK']", "['IN', 'GB']", "['PS', 'FR', 'DE']", "['CZ']",
       "['CA', 'NG']", "['VN']", "['NL', 'GB']",
       "['CA', 'HU', 'MX', 'ES', 'GB', 'US']", "['FR', 'GB', 'US']",
       "['FR', 'NL', 'GB', 'US']", "['CN', 'CA', 'US']", "['CA', 'GB']",
       "['KR', 'US']", "['FR', 'RO', 'GB', 'BE', 'DE']", "['US', 'MX']",
       "['HK', 'IS', 'US']", "['IN', 'CN', 'US', 'GB']", "['BE', 'FR']",
       "['PR', 'US', 'GB', 'CN']", "['GB', 'DE']", "['US', 'PR']",
       "['IT', 'CH', 'FR']", "['IT', 'ES', 'FR']", "['US', 'IS', 'NO']",
       "['IQ', 'GB']", "['HU']", "['US', 'AU', 'GB']",
       "['CZ', 'GB', 'US']", "['US', 'IE', 'CA']", "['TH']",
       "['IR', 'US', 'FR']", "['BE']",
       "['GB', 'ID', 'CA', 'CN', 'SG', 'US']", "['ES', 'FR']",
       "['SG', 'GB', 'US']", "['GE', 'DE', 'FR']", "['CA', 'US', 'DE']",
       "['CA', 'IE']", "['NL', 'BE']", "['US', 'KH']", "['FR', 'JP']",
       "['PR']", "['US', 'CA', 'CN']", "['CN', 'US', 'ES']",
       "['CU', 'US']", "['BG', 'US']", "['US', 'BG']",
       "['US', 'DK', 'GB']", "['ES', 'IT']", "['TR', 'US']",
       "['PE', 'DE', 'NO']", "['LU', 'US', 'FR']",
       "['IL', 'MA', 'US', 'BG', 'GB']", "['AR', 'CL']",
       "['AR', 'ES', 'UY']", "['JP', 'CN']", "['US', 'AU']",
       "['QA', 'TN', 'FR']", "['ES', 'MX']", "['PH', 'SG']",
       "['US', 'AE']", "['DE', 'DK', 'NL', 'GB']", "['NL', 'MX']",
       "['CA', 'CN']", "['NO', 'SE', 'DK', 'NL']", "['US', 'DE', 'ZA']",
       "['IS', 'SE', 'BE']", "['DE', 'ES']", "['CN', 'FR', 'TW', 'US']",
       "['KH']", "['BE', 'FR', 'IT']", "['DE', 'CH']",
       "['JP', 'KR', 'FR']", "['DE', 'NZ', 'GB']", "['PE']",
       "['MX', 'US']", "['US', 'DK']", "['PL', 'US']", "['KE']", "['GH']",
       "['IT', 'CH', 'VA', 'FR', 'DE']", "['PE', 'GB', 'US', 'IL', 'IT']",
       "['SA', 'SY', 'AE']", "['US', 'KR']", "['IN', 'FR']",
       "['RS', 'PL', 'RU']", "['CL', 'NL', 'FR']", "['IE', 'CA']",
       "['US', 'NL']", "['TZ']", "['IT', 'ES']", "['ID', 'MY', 'SG']",
       "['FR', 'LU', 'CA']", "['FR', 'QA', 'TN', 'BE']",
       "['PL', 'CH', 'AL', 'IT']", "['CZ', 'US']", "['AR', 'FR']",
       "['DE', 'IT']", "['IT', 'FR']", "['MX', 'FI']", "['CA', 'BR']",
       "['IN', 'MX']", "['BR', 'DK', 'FR', 'DE', 'PL', 'AR']",
       "['ZA', 'US', 'CA']", "['ES', 'BE']", "['PY']", "['US', 'NG']",
       "['US', 'BE', 'GB']", "['ZW']", "['IT', 'AR']",
       "['AT', 'IQ', 'US']", "['GE']", "['AR', 'IT']", "['NG', 'NO']",
       "['IS', 'GB']", "['MX', 'CO']", "['AR', 'US']", "['KW']",
       "['JP', 'GB']", "['TW', 'US']", "['NP', 'IN']",
       "['AU', 'US', 'CN']", "['FR', 'IN', 'SG']", "['LB', 'PS']",
       "['JP', 'US', 'CA']", "['CM']", "['BD', 'IN']", "['CA', 'ZA']",
       "['FR', 'PS', 'CH', 'QA']", "['NL', 'JO', 'DE']",
       "['GB', 'DK', 'GR']", "['MX', 'AR']", "['US', 'CL', 'MX']",
       "['KG']", "['CH']", "['BD']", "['LU']", "['ZA', 'GB']",
       "['BT', 'CN']", "['CA', 'HU', 'US']", "['BE', 'LT', 'NL']",
       "['IT', 'MC', 'US', 'CA']", "['CN', 'US', 'AU', 'CA']",
       "['BE', 'SE', 'GB']", "['GB', 'CZ', 'FR']", "['US', 'MW', 'GB']",
       "['US', 'CY']", "['BE', 'FR', 'SN']", "['BR', 'FR', 'ES', 'BE']",
       "['US', 'CH']", "['US', 'IL']", "['FR', 'LT', 'GB']",
       "['GB', 'IE']", "['GB', 'IT']", "['JO', 'TH', 'US', 'AL']",
       "['PT', 'US']", "['IL', 'US', 'FR', 'DE']", "['TW', 'MY']",
       "['US', 'CA', 'FR', 'ES']", "['FI', 'NO']", "['US', 'FR', 'JP']",
       "['GB', 'JP']", "['US', 'CN', 'GB']",
       "['US', 'FR', 'SE', 'GB', 'DE', 'DK', 'CA']", "['DE', 'AT']",
       "['US', 'TH']", "['PH', 'US']", "['BR', 'MX']", "['NO', 'CA']",
       "['CO', 'ES']", "['CN', 'DE', 'GB']", "['NO', 'DE']",
       "['ES', 'PT']", "['IL', 'US']", "['ES', 'BE', 'DE']",
       "['TH', 'US']", "['US', 'FR', 'ES']", "['ES', 'FR', 'AR']",
       "['NL', 'PL', 'UA', 'GB', 'US']", "['QA', 'PS']",
       "['RS', 'UY', 'AR']", "['FR', 'IT']", "['CA', 'LK']",
       "['US', 'AR']", "['EG', 'US']", "['US', 'IN']",
       "['FR', 'LU', 'BE', 'KH']", "['US', 'BE', 'ES']",
       "['CA', 'FR', 'JP', 'GB', 'US']", "['AT', 'DE']",
       "['US', 'GB', 'DE']", "['FR', 'MX', 'CO']", "['BR', 'FR']",
       "['JO']", "['FR', 'IN', 'QA']", "['AR', 'PE']", "['MU']",
       "['DE', 'DK', 'EG']", "['US', 'IE']", "['IO']", "['TW', 'CN']",
       "['FR', 'NL', 'SG']", "['SN']", "['UY']", "['DE', 'IN', 'AT']",
       "['MA', 'FR', 'QA']", "['PS', 'PH']", "['EG', 'SA']",
       "['ES', 'CN']", "['CL', 'AR', 'CA']", "['AR', 'CO']",
       "['GT', 'UY']", "['AF', 'DE', 'PS']", "['ZA', 'AO']",
       "['HK', 'PH']", "['SG', 'MY']", "['SE', 'US']",
       "['LB', 'US', 'NL', 'CA']", "['NL', 'PS', 'US', 'LB']",
       "['DK', 'LB', 'GB']", "['UY', 'MX', 'ES']", "['PH', 'JP']",
       "['CN', 'JP', 'US']", "['NA']", "['LB', 'QA', 'SY', 'FR']",
       "['PS', 'DK', 'LB']", "['US', 'CZ']",
       "['GB', 'AU', 'CA', 'GR', 'NZ']", "['GR', 'GB', 'US']",
       "['DE', 'FR']", "['NL', 'US']", "['AT', 'GB', 'US']",
       "['CH', 'DE']", "['GB', 'US', 'DE']", "['DK', 'IS']",
       "['FR', 'DE', 'US']", "['US', 'JP', 'TH']", "['FR', 'DE']",
       "['RO', 'US']", "['ES', 'KN']", "['SE', 'GB']",
       "['SG', 'US', 'IN']", "['DE', 'AU']", "['GB', 'CA']",
       "['IE', 'US', 'CA']", "['PT']", "['US', 'PL', 'KR']",
       "['LU', 'FR']", "['IT', 'BR']", "['GB', 'HU', 'NL', 'CH']",
       "['BR', 'DE', 'QA', 'MX', 'US', 'CH', 'AR']", "['ES', 'PE']",
       "['BE', 'GB', 'DE']", "['ZA', 'GB', 'US']", "['CL', 'PE']",
       "['CA', 'CN', 'US']", "['SG', 'US']", "['BR', 'US']",
       "['BE', 'NL']", "['RU', 'US']", "['ES', 'US']", "['CZ', 'DE']",
       "['NZ', 'HK']", "['MA', 'SA', 'TN', 'EG', 'LB']", "['CN', 'GB']",
       "['AF']", "['BE', 'LU']", "['BE', 'DE']", "['SE', 'RO']",
       "['ZA', 'US']", "['GB', 'IN']", "['HU', 'CA']", "['NG', 'CA']",
       "['TZ', 'GB']", "['PH', 'FO']"], dtype=object)


2.8 NaN identification

A NaN is a not a number value. NaN is equivalent to missing value.

We are going to calculate, for each variable, the proportion of missing values over the total number of observations:

Prop_NA = Netflix_Data.isnull().sum() / len(Netflix_Data)

Prop_NA
id                      0.000000
title                   0.000171
type                    0.000000
description             0.003077
release_year            0.000000
age_certification       0.447692
runtime                 0.000000
genres                  0.000000
production_countries    0.000000
seasons                 0.640000
imdb_id                 0.068889
imdb_score              0.082393
imdb_votes              0.085128
tmdb_popularity         0.015556
tmdb_score              0.053162
dtype: float64

We can see that there are variables with a high proportion of missing values, as age_certification (44.77%).

season would be the variable with higher proportion of missing values, but it is because of season only is defined for type=SHOW.


2.9 Variable Scaling

Scaling a variable is applying a transformation, in order to obtain new properties for the transformed variable, properties that the original variable doesn’t have.

In this article, we will focus on three scaling methods: standard scaling, normalization (0,1), and normalization (a,b).

In any case, there are more procedures that will not be explored here, so for a more extensive list, it is recommended to consult the sklearn documentation on this topic: https://scikit-learn.org/stable/modules/preprocessing.html

Some of the concepts that appear in this secction will be explained with more details in Statistical Description section, such as the concept of statistical variable, sample, mean and variance.


2.9.1 Standard Scaling

Given a quantitative statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

The standard scaling version of de \(\hspace{0.1cm} X_k\hspace{0.1cm}\) is defined as: \(\\[0.25cm]\)

\[X_k^{std} \hspace{0.1cm} =\hspace{0.1cm} \dfrac{X_k - \overline{X}_k}{\sigma(X_k)} \\\]

Properties:

  • \(\hspace{0.1cm} \overline{X}_k^{\hspace{0.07cm}std} \hspace{0.1cm} =\hspace{0.1cm} 0 \\[0.8cm]\)

  • \(\hspace{0.2cm} \sigma( X_k^{\hspace{0.07cm}std} )^2 \hspace{0.1cm} =\hspace{0.1cm} 1 \\\)

Proof :

  • \(\overline{X}_k ^{\hspace{0.07cm}std} \hspace{0.1cm} =\hspace{0.1cm} \overline{ \left( \dfrac{X_k - \overline{X_k}}{\sigma(X_j)} \right) } \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{\sigma(X_j)} \cdot \left( \hspace{0.12cm} \overline{ \hspace{0.08cm} X_j - \overline{X_k} \hspace{0.08cm} } \hspace{0.12cm} \right) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{\sigma(X_j)} \cdot \left( \hspace{0.12cm} \overline{X_j} - \overline{ \hspace{0.08cm} \overline{X_j} \hspace{0.08cm} } \hspace{0.12cm} \right) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{\sigma(X_j)} \cdot \left( \hspace{0.08cm} \overline{X_j} - \overline{X_j} \hspace{0.08cm} \right) \hspace{0.1cm} =\hspace{0.1cm} \dfrac{1}{\sigma(X_j)} \hspace{0.07cm}\cdot \hspace{0.07cm} 0 \hspace{0.1cm}=\hspace{0.1cm} 0 \\[0.8cm]\) \(\\[0.6cm]\)

  • \(\sigma\left( X_j^{\hspace{0.07cm}std} \right)^2 \hspace{0.1cm} =\hspace{0.1cm} \sigma\left( \dfrac{X_j - \overline{X_j} }{\sigma(X_j)} \right)^2 \hspace{0.1cm} =\hspace{0.1cm} \dfrac{1}{\sigma(X_j)^2} \cdot \sigma\left( \hspace{0.08cm} X_j - \overline{X_j} \hspace{0.08cm} \right)^2 \hspace{0.1cm} =\hspace{0.1cm} \dfrac{1}{\sigma(X_j)^2} \cdot \sigma( \hspace{0.08cm} X_j \hspace{0.08cm} )^2 \hspace{0.1cm}=\hspace{0.1cm} 1\)


2.9.2 Normalización (0,1)

Dada la muestra de una variable estadística \(\hspace{0.1cm} X_j=(x_{1j},...,x_{nj})^t\)

La versión normalizada \(\hspace{0.1cm}(0,1)\hspace{0.1cm}\) de \(\hspace{0.1cm}X_j\hspace{0.1cm}\) es la siguiente variable: \(\\[1cm]\)

\[X_j^{norm(0,1)} = \dfrac{X_j - Min(X_j)}{Max(X_j) - Min(X_j)} \\\]

Propiedades :

  • \(\hspace{0.2cm} Max \left(X_j^{norm(0,1)} \right) \hspace{0.1cm}=\hspace{0.1cm} 1 \\[0.8cm]\)

  • \(\hspace{0.2cm} Min \left( X_j^{norm(0,1)} \right) \hspace{0.1cm}=\hspace{0.1cm} 0 \\\)

Demostraciones :

  • \(\hspace{0.1cm} Max \left( X_j^{norm(0,1)} \right) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{ Max(X_j) - Min(X_j)}{Max(X_j) - Min(X_j)} \hspace{0.1cm}=\hspace{0.1cm} 1 \\[0.8cm]\)

  • \(\hspace{0.1cm} Min \left( X_j^{norm(0,1)} \right) \hspace{0.1cm}=\hspace{0.1cm} \dfrac{ Min(X_j) - Min(X_j)}{Max(X_j) - Min(X_j)} \hspace{0.1cm}=\hspace{0.1cm} 0\)


2.9.3 Normalización (a,b)

Dada la muestra de una variable estadística \(\hspace{0.1cm}X_j=(x_{1j},...,x_{nj})^t\)

La versión normalizada \(\hspace{0.1cm}(a,b)\hspace{0.1cm}\) de \(\hspace{0.1cm}X_j\hspace{0.1cm}\) es la siguiente variable:

\[X_j^{norm(a,b)} = X_j^{norm(0,1)} \cdot (b - a) + a \\\]

Propiedades :

  • \(\hspace{0.2cm} Max \left(X_j^{norm(a,b)} \right) = b \\[0.8cm]\)

  • \(\hspace{0.2cm} Min \left( X_j^{norm(a,b)} \right) = a \\\)

Demostraciones :

  • \(\hspace{0.1cm} Max \left(X_j^{norm(a,b)} \right) = Max \left(X_j^{norm(0,1)} \right)\cdot (a-b)+b= 1\cdot (b-a)+a = b \\[0.8cm]\)

  • \(\hspace{0.1cm} Min \left(X_j^{norm(a,b)} \right) = Min \left(X_j^{norm(0,1)} \right)\cdot (a-b)+b= 0\cdot (b-a)+a = a\)


2.10 Recodificación estandar de variables categoricas

Dada la muestra de una variable estadística categórica \(\hspace{0.1cm}X_j=(x_{1j},...,x_{nj})^t\hspace{0.1cm}\) con \(\hspace{0.1cm}k\hspace{0.1cm}\) categorias tal que su recorrido (campo de variación) es \(\hspace{0.1cm}\Gamma( X_j) = \lbrace g_1, g_2 , ..., g_k \rbrace\) ;

La recodficación a formato estándar de \(\hspace{0.1cm}X_j\hspace{0.1cm}\) consiste en obtener una nueva variable \(\hspace{0.1cm}X_j^{recod}\hspace{0.1cm}\) definida como :

\[x_{ij}^{recod} = \left\lbrace\begin{array}{l} 0 \hspace{0.3cm} , \text{ si} \hspace{0.2cm} x_{ij} = g_1 \\ 1 \hspace{0.3cm} , \hspace{0.15cm} \text{si} \hspace{0.2cm} x_{ij} = g_2 \\ ... \\ k-1 \hspace{0.3cm} ,\text{ si} \hspace{0.2cm} x_{ij} = g_1 \end{array}\right. \]

Observación :

\(\hspace{0.1cm}\Gamma( X_j^{recod}) = \lbrace 0,1,..., k-1 \rbrace\)


2.11 Categorización de variables cuantitativas

Dada una muestra de una variable estadistica cuantitativa \(\hspace{0.1cm} X_j=(x_{1j},...,x_{nj})^t\hspace{0.1cm}\) ;

La categorización de \(\hspace{0.1cm}X_j\hspace{0.1cm}\) consisiste en obtener una nueva variable \(\hspace{0.1cm}X_j^{cat}\hspace{0.1cm}\) definida como:

\[x_{ij}^{cat} = \left\lbrace\begin{array}{l} 0 \hspace{0.3cm} , \text{ si} \hspace{0.2cm} x_{ij} \in [L_0 , L_1) \\ 1 \hspace{0.3cm} , \hspace{0.15cm} \text{si} \hspace{0.2cm} x_{ij} \in [L_1 , L_2) \\ ... \\ k-1 \hspace{0.3cm} ,\text{ si} \hspace{0.2cm} x_{ij} \in [L_{k-1} , L_k) \end{array}\right. \\[1cm] \]

Otra forma de expresarlo:

\[x_{ij}^{cat} = r \Leftrightarrow x_{ij} \in [L_r , L_{r+1}) \\\]

donde:

\([L_0 , L_1), [L_1 , L_2), \dots , [L_{k-1} , L_k)\hspace{0.1cm}\) son denominados intervalos de categorizacioón de \(\hspace{0.1cm}X_j\hspace{0.1cm}\) , y son una serie de intervalos con las siguientes propiedades:

  • Son disjuntos dos a dos, es decir, no comparten elementos.

  • Cada observación de la muestra \(\hspace{0.1cm}X_j\hspace{0.1cm}\) pertenece a un intervalo.

  • Tienen la misma amplitud.

Como consecuencia:

  • Cada elemento de \(\hspace{0.1cm}X_j\hspace{0.1cm}\) pertenece a un único intervalo. \(\\[1cm]\)

Observaciones :

\(X^{cat}_j\hspace{0.1cm}\) es una versión categorizada de la variable \(\hspace{0.1cm}X_j\) \(\\[1cm]\)

¿Cómo definir los intervalos de categorización?

Existen diferentes procedimientos para definir los intervalos de categorización, los más habituales son reglas basadas en los cuantiles de la variable considerada.

A continuación vamos a exponer algunos procedimientos básados en cuantiles y otro alternativo, la regla de Scott.


2.11.1 Regla de la media

Siguiendo la regla de la media, los intervalos de categorizacion de una variable cuantitativa \(\hspace{0.1cm} X_j\hspace{0.1cm}\) serían los siguientes:\(\\[0.8cm]\)

\[ \left[\hspace{0.1cm} Min(X_j) \hspace{0.1cm} ,\hspace{0.1cm} \overline{X}_j \hspace{0.1cm}\right] \hspace{0.1cm},\hspace{0.1cm} \left(\hspace{0.1cm} \overline{X}_j \hspace{0.1cm},\hspace{0.1cm} Max(X_j) \hspace{0.1cm}\right] \]


Con la regla de la media, la versión categorica \(\hspace{0.1cm}X_j^{cat}\hspace{0.1cm}\) de la variable cuantitativa \(\hspace{0.1cm}X_j\hspace{0.1cm}\) se define como: \(\\[0.8cm]\)

\[x_{ij}^{cat} = \left\lbrace\begin{array}{l} 0 \hspace{0.3cm} , \text{ si} \hspace{0.2cm} x_{ij} \in \left[\hspace{0.1cm}Min(X_j)\hspace{0.1cm},\hspace{0.1cm} \overline{X}_j\hspace{0.1cm}\right) \\ 1 \hspace{0.3cm} ,\text{ si} \hspace{0.2cm} x_{ij} \in \left[\hspace{0.1cm}\overline{X}_j \hspace{0.1cm},\hspace{0.1cm} Max(X_j)\hspace{0.1cm}\right) \end{array}\right. \\[1cm] \]


2.11.2 Regla de la mediana


2.11.3 Regla de los cuartiles


2.11.4 Regla de Scott


2.12 Dummificación de variables categoricas

2.13 Dealing with NaN


3 Statistical Description

3.1 Statistical variable

A statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) can be modeled as a random variable.

Under this approach, we can apply all probability theory on random variables to statistical variables. \(\\[0.4cm]\)

3.2 Range of a statistical variable

The range of a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) is denoted by \(\hspace{0.05cm}Range(\mathcal{X}_k)\hspace{0.05cm}\), and is defined as the set of possible values of \(\hspace{0.05cm}\mathcal{X}_k\). \(\\[0.4cm]\)

3.2.1 Statistical variable types: quantitative and categorical

  • The variable \(\mathcal{X}_k\) is quantitative if the elements of it´s range are conceptually numbers. \(\\[0.5cm]\)

  • The variable \(\mathcal{X}_k\) is categorical if the elements of it´s range aree labels or categories (they can be numbers at a symbolic level but not at a conceptual level). \(\\[0.4cm]\)

3.2.2 Quantitative variable types: continuous and discrete

We can distinguish at least two types of quantitative variables: continuous and discrete.

  • \(\mathcal{X}_k\hspace{0.05cm}\) is continuous if \(\hspace{0.05cm}Range(\mathcal{X}_k)\hspace{0.05cm}\) is a not countable set. \(\\[0.5cm]\)

  • \(\mathcal{X}_k\hspace{0.05cm}\) is discrete if \(\hspace{0.05cm}Range(\mathcal{X}_k)\hspace{0.05cm}\) is countable set. \(\\[0.2cm]\)

Note:

In particular, variables whose range is a finite set will be discrete.

Variables whose range isn´t a finite set will be continuous. \(\\[0.4cm]\)

3.2.3 Categorical variable types: r-ary

Let \(\mathcal{X}_k\) a categorical variable ,

  • \(\mathcal{X}_k\) is r-aria if it´s range has r elements that are categories or labels.

In Statistics binary (2-aria) categorical variables are particularly important. \(\\[0.4cm]\)

3.2.4 Categorical variable types: nominal and ordinal

Let \(\mathcal{X}_k\) a \(r\)-ary categorical variable.

  • \(\mathcal{X}_k\) is nominal if there is no ordering between the \(r\) categories of it’s range. \(\\[0.4cm]\)

  • \(\mathcal{X}_k\) is ordinal if there is ordering between the \(r\) categories of it’s range. \(\\[0.4cm]\)

3.3 Sample of a statistical variable

Given a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\).

A sample of \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) is a vector of values of \(\hspace{0.05cm}\mathcal{X}_k\), called observations.

Therefore:

\[ X_k \hspace{0.05cm} = \hspace{0.05cm} \begin{pmatrix} x_{1k} \\ x_{2k}\\ ... \\ x_{nk} \end{pmatrix} \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t \\ \]

is a sample of a statistical variable because is a vector with the values or observations of the variable \(\hspace{0.05cm} \mathcal{X}_k \hspace{0.05cm}\) for \(\hspace{0.05cm} n \hspace{0.05cm}\) elements or individuals of a sample.

Where: \(\hspace{0.1cm} x_{ik}\hspace{0.05cm}\) is the value \(\hspace{0.05cm} i\)-th observation of the variable \(\hspace{0.05cm} \mathcal{X}_k\). \(\\[0.4cm]\)

3.4 Arithmetic Mean

Given a quantitative statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

The arithmetic mean of \(\hspace{0.05cm}X_k \hspace{0.05cm}\) is defined as: \(\\[0.3cm]\)

\[\overline{\hspace{0.05cm} X_k \hspace{0.05cm} } \hspace{0.1cm}=\hspace{0.1cm} \dfrac{1}{n} \cdot \sum_{i=1}^n \hspace{0.05cm} x_{ik}\] \(\\[0.4cm]\)

Properties:

  • Existence: the arithmetic mean of a sample \(X_k\) of a statisitcal variable \(\mathcal{X}_k\) always exist, for any \(X_{k}\).

  • Commutatividad: arithmetic mean isn’t affected by the order of the elements of the sample \(X_k\) .

  • \(\overline{X_k} + \overline{X_j} = \overline{X_k + X_j}\)

  • \(\overline{ a\cdot X_k + b} = a \cdot \overline{X_k} + b\) , for any \(a,b \in \mathbb{R}\)


3.5 Weighted Mean

Given a quantitative statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

And given a weights for each observation of the variable \(\hspace{0.05cm} \mathcal{X}_k \hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) \(w \hspace{0.05cm} = \hspace{0.05cm} (w_1,w_2,...,w_n)^t\)

The weighted mean of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) with the weights vector \(\hspace{0.05cm} w \hspace{0.05cm}\) is defined as:

\[ \overline{X_k} (w) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{\hspace{0.1cm}\sum_{i=1}^{n} \hspace{0.05cm} w_{i} \hspace{0.1cm}} \hspace{0.05cm}\cdot\hspace{0.05cm} \sum_{i=1}^{n} \hspace{0.1cm} x_{ik} \cdot w_i \] \(\\[0.4cm]\)

3.6 Geometric Mean

Given the variable \(\hspace{0.05cm} X_k=(x_{1k}, x_{2k},...,x_{nk})^t\).

The geometric mean of the variable \(\hspace{0.05cm}X_k\hspace{0.05cm}\) is defined as: \(\\[0.3cm]\)

\[ \overline{X_k}_{geo} \hspace{0.05cm} = \hspace{0.05cm} \sqrt{\Pi_{i=1}^{n} x_{ik}} \hspace{0.05cm} = \hspace{0.05cm} \sqrt{x_{1k}\cdot x_{2k}\cdot...\cdot x_{nk}} \] \(\\[0.4cm]\)

3.7 Median

Given a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

The median of \(\hspace{0.05cm}X_k \hspace{0.05cm}\) is defined as a value \(Me(X_k)\) such that: \(\\[0.3cm]\)

\[\dfrac{1}{n} \cdot \sum_{i=1}^n \hspace{0.1cm} \mathbb{I} \hspace{0.05cm} \bigl[ \hspace{0.1cm} x_{ik} \hspace{0.05cm} \leq \hspace{0.05cm} Me(X_k) \hspace{0.1cm} \bigr] \hspace{0.1cm} = \hspace{0.1cm} 0.50\]

where: \(\hspace{0.15cm}\mathbb{I}\hspace{0.1cm}\) is the indicator function. \(\\[0.4cm]\)

Properties:

Existencia: La mediana siempre existe para cualquier conjunto de números.

Invariante a permutaciones: El orden de los números no afecta a la mediana.

No linealidad: La mediana de una suma de números no es igual a la suma de las medianas de cada conjunto de números.

Invarianza a la escala: Multiplicar todos los números por una constante no afecta la mediana.

Si se cumple \(median(cX_j) = c\cdot median(X_j)\) pero no es cierto en general que \(median(cX_j + b) = c\cdot median(X_j) + b\)


3.8 Mode

Given a categorical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

The mode of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is the most repeated value in \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\), so, the mode of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is the most frequent value of \(\hspace{0.05cm} X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) \(\\[0.4cm]\)

3.9 Variance

Given a quantitative variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

The variance of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as:

\[\sigma(X_k)^2 \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{n} \cdot \sum_{i=1}^n \hspace{0.05cm} \left(\hspace{0.05cm} x_{ik} - \overline{X_k} \hspace{0.05cm}\right)^2\]

The standard deviation or standard error of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as:

\[\sqrt{ \sigma(X_k)^2 } \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{n} \cdot \sum_{i=1}^n \left( \hspace{0.05cm} x_{ik} - \overline{X_k} \hspace{0.05cm} \right)\] \(\\[0.4cm]\)

3.10 Median Absolute Deviation

Given a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

The median absolute deviation (MAD) of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as:

\[MAD(X_k) \hspace{0.1cm} = \hspace{0.1cm} Me \bigl( \hspace{0.1cm} \left| \hspace{0.05cm} X_k - Me(X_k) \hspace{0.05cm} \right| \hspace{0.1cm} \bigr) \hspace{0.1cm} = \hspace{0.1cm} Me \hspace{0.1cm} \Bigr[ \hspace{0.1cm} \left( \hspace{0.2cm} \left| \hspace{0.1cm} x_{ik} - Me(X_k) \hspace{0.1cm} \right| \hspace{0.15cm} : \hspace{0.15cm} i = 1,\dots,n \hspace{0.2cm} \right) \hspace{0.1cm} \Bigr]\] \(\\[0.4cm]\)

3.11 Quantiles

Given a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

The \(\hspace{0.05cm}q\)-order quantile of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as a value \(Q(X_k , q)\) such that:

\[\dfrac{1}{n} \cdot \sum_{i=1}^n \hspace{0.1cm} \mathbb{I} \hspace{0.05cm} \bigl[ \hspace{0.1cm} x_{ik} \hspace{0.05cm} \leq \hspace{0.05cm} Q(\hspace{0.05cm} X_k \hspace{0.05cm},\hspace{0.05cm} q \hspace{0.05cm}) \hspace{0.1cm} \bigr] \hspace{0.1cm} = \hspace{0.1cm} q\]

where: \(\hspace{0.15cm}\mathbb{I}\hspace{0.1cm}\) is the indicator function. \(\\[0.3cm]\)

Observation:

The median is the 0.5-order quantile. \(\\[0.4cm]\)

3.12 Kurtosis

Given a quantitative variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

The kurtosis coefficient of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as: \(\\[0.35cm]\)

\[ \Psi(X_k) = \dfrac{\mu_{4}}{\sigma(X_k)^{4}} \]

where:

\[ \mu_{4}\hspace{0.1cm} =\hspace{0.1cm} \frac{1}{n} \cdot \sum_{i=1}^{n} \hspace{0.05cm} x_{ik}^4 \\[0.3cm] \]

Propierties:

  • If \(\hspace{0.12cm}\Psi(X_k) \hspace{0.05cm} > \hspace{0.05cm} 3\hspace{0.08cm}\) \(\hspace{0.2cm}\Rightarrow\hspace{0.2cm}\) the distribution of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is more pointed and with longer tails than the normal distribution. \(\\[0.5cm]\)

  • If \(\hspace{0.12cm}\Psi(X_k) \hspace{0.05cm} < \hspace{0.05cm} 3\hspace{0.08cm}\) \(\hspace{0.2cm}\Rightarrow\hspace{0.2cm}\) the distribution of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is less pointed and with shorter tails than the normal distribution. \(\\[0.4cm]\)

3.13 Skewness

Given a quantitative variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

The skewness coefficient of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as: \(\\[0.25cm]\)

\[ \Gamma(X_k) = \dfrac{\mu_{3}}{\sigma(X_k)^{3}} \]

where:

\[ \mu_{3}\hspace{0.1cm} =\hspace{0.1cm} \frac{1}{n} \cdot \sum_{i=1}^{n} \hspace{0.05cm} x_{ik}^3 \\[0.3cm] \]

Propierties:

Fisher’s skewness coefficient measures the degree of skewness in the distribution of a given statistical variable.

  • If \(\hspace{0.12cm} \Gamma(X_k) > 0\) \(\hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) the distribution of \(X_k\) has skewness to the right. \(\\[0.6cm]\)

  • If \(\hspace{0.12cm} \Gamma(X_k) < 0\) \(\hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) the distribution of \(X_k\) has skewness to the left. \(\\[0.4cm]\)

3.14 Outliers

There are several definitions of outlier, but here we are going to consider the classic one.

Given a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.

For any \(\hspace{0.05cm} i\in \lbrace 1,...,n \rbrace\) ,

The observation \(\hspace{0.05cm} x_{ik}\hspace{0.05cm}\) of \(\hspace{0.05cm} \mathcal{X}_k\hspace{0.05cm}\) is an outlier if and only if:

\[x_{ik} \hspace{0.05cm} >\hspace{0.05cm} Q(X_k \hspace{0.05cm} , \hspace{0.05cm} 0.75) + 1.5\cdot IQR(X_k) \hspace{0.5cm}\text{or}\hspace{0.5cm} x_{ik} \hspace{0.05cm} <\hspace{0.05cm} Q(X_k \hspace{0.05cm} , \hspace{0.05cm} 0.25) - 1.5\cdot IQR(X_k) \\\]

where: \(\hspace{0.25cm} IQR(X_k) \hspace{0.12cm} = \hspace{0.12cm} Q(X_k \hspace{0.05cm} , \hspace{0.05cm} 0.75) \hspace{0.08cm} - \hspace{0.08cm} Q(X_k \hspace{0.05cm} , \hspace{0.05cm} 0.25) \hspace{0.25cm}\) is the interquartile range of \(\hspace{0.05cm} X_k \hspace{0.05cm}\).


3.15 Data Matrix

Given \(\hspace{0.05cm} p \hspace{0.05cm}\) statistical variables \(\hspace{0.05cm}\mathcal{X}_1, \mathcal{X}_2, \dots \mathcal{X}_p\hspace{0.05cm}\), and given a sample \(\hspace{0.05cm}X_k = (x_{1k},...,x_{nk})^t\hspace{0.05cm}\) of \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) for each \(\hspace{0.05cm}k \in \lbrace 1,...,p \rbrace\).

A data matrix of the variables \(\hspace{0.05cm}\mathcal{X}_1,...,\mathcal{X}_1\hspace{0.05cm}\) would be: \(\\[0.35cm]\)

\[ X \hspace{0.05cm}=\hspace{0.05cm} \left( X_1 , X_2,\dots , X_p \right) \hspace{0.05cm}=\hspace{0.05cm} \begin{pmatrix} x_{1}^{t} \\ x_{2} ^t \\ ... \\ x_{n} ^t \end{pmatrix} \hspace{0.05cm}=\hspace{0.05cm} \begin{pmatrix} x_{11} & x_{12}&...&x_{1p}\\ x_{21} & x_{22}&...&x_{2p}\\ ...&...&...&...\\ x_{n1}& x_{n2}&...&x_{np} \end{pmatrix} \\ \]

where:

\(x_i ^t \hspace{0.05cm}=\hspace{0.05cm} \left( x_{i1}, x_{i2}, \dots , x_{ip} \right)\hspace{0.1cm}\) is the vector with the values of the \(\hspace{0.05cm} p \hspace{0.05cm}\) statistical variables \(\hspace{0.05cm}\mathcal{X}_1,\dots ,\mathcal{X}_p\hspace{0.05cm}\) for the \(\hspace{0.05cm}i\)-th element of the sample, for \(\hspace{0.05cm} i \in \lbrace 1,...,n \rbrace\) \(\\[0.4cm]\)

Observations:

\(X \hspace{0.1cm}\) is a matrix with \(\hspace{0.05cm}p\hspace{0.05cm}\) columns and \(\hspace{0.05cm}n\hspace{0.05cm}\) rows, so, is a matrix of size \(\hspace{0.05cm} p\hspace{0.05cm} \text{x}\hspace{0.05cm}n\). \(\\[0.4cm]\)

3.16 Covariance

Given the statistical variables \(\hspace{0.05cm}\mathcal{X}_1, \mathcal{X}_2, \dots \mathcal{X}_p\hspace{0.05cm}\), and given a sample \(\hspace{0.05cm}X_k = (x_{1k},...,x_{nk})^t\hspace{0.05cm}\) of \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) for each \(\hspace{0.05cm}k \in \lbrace 1,...,p \rbrace\).

The covariance between \(\hspace{0.05cm}X_j\hspace{0.05cm}\) and \(\hspace{0.05cm}X_r\hspace{0.05cm}\) is defined as:

\[ S(X_k, X_r) \hspace{0.1cm}=\hspace{0.1cm} \frac{1}{n} \cdot \sum_{i=1}^{n} \left(\hspace{0.05cm} x_{ik} - \overline{X_k} \hspace{0.05cm}\right)\cdot \left(\hspace{0.05cm} x_{ir} - \overline{X_r} \hspace{0.05cm}\right) \] \(\\[0.4cm]\)

3.16.1 Properties of covariance

  • \(S(X_k,X_r) \in (-\infty, \infty)\) \(\\[0.5cm]\)

  • \(S(X_k,X_r) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{n}\cdot \sum_{i=1}^{n} (x_{ik} \cdot x_{ir}) \hspace{0.05cm} - \hspace{0.05cm} \overline{X_k} \cdot \overline{X_r} \hspace{0.1cm} = \hspace{0.1cm} \overline{X_k\cdot X_r} \hspace{0.05cm} - \hspace{0.05cm} \overline{x_k} \cdot \overline{x_r}\) \(\\[0.5cm]\)

  • \(S(X_k, a + b\cdot X_r) \hspace{0.1cm} = \hspace{0.1cm} b\cdot S(X_k,X_r)\) \(\\[0.5cm]\)

  • \(S(X_k,X_r) \hspace{0.1cm} = \hspace{0.1cm} S(X_r,X_k)\) \(\\[0.5cm]\)

  • \(S(X_k,X_r)\hspace{0.05cm} >\hspace{0.05cm} 0 \hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) Positive Relationship between \(\hspace{0.05cm}X_k\hspace{0.05cm}\) and \(\hspace{0.05cm}X_r\hspace{0.05cm}\). \(\\[0.5cm]\)

  • \(S(X_k,X_r)\hspace{0.05cm} <\hspace{0.05cm} 0 \hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) Negative Relationship between \(\hspace{0.05cm}X_k\hspace{0.05cm}\) and \(\hspace{0.05cm}X_r\hspace{0.05cm}\). \(\\[0.5cm]\)

  • \(S(X_k,X_r) \hspace{0.05cm}=\hspace{0.05cm} 0 \hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) There is not relationship between \(\hspace{0.05cm}X_k\hspace{0.05cm}\) and \(\hspace{0.05cm}X_r\hspace{0.05cm}\). \(\\[0.5cm]\)

3.17 Covariance Matrix

The covariance matrix of a given data matrix \(\hspace{0.05cm}X \hspace{0.05cm}=\hspace{0.05cm} (X_1,...,X_p)\hspace{0.05cm}\) is: \(\\[0.2cm]\)

\[ S_X = \bigl( \hspace{0.2cm} s_{k,r} \hspace{0.05cm} : \hspace{0.05cm} k,r \in \lbrace 1,...,p \rbrace \hspace{0.2cm} \bigr) \]

where: \(\hspace{0.15cm} s_{k,r} = S(X_k , X_r)\) \(\\[0.25cm]\)

Matrix expression of the covariance matrix :

\[ S_X \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{n} \cdot X\hspace{0.1cm}^t \cdot H \cdot X \]

where: \(\hspace{0.15cm} H \hspace{0.1cm}=\hspace{0.1cm} I_n \hspace{0.05cm} - \hspace{0.05cm} \dfrac{1}{n} \cdot 1_{nx1} \cdot 1^t_{nx1} \hspace{0.15cm}\) is the centered matrix

3.18 Correlation

Given the statistical variables \(\hspace{0.05cm}\mathcal{X}_1, \mathcal{X}_2, \dots \mathcal{X}_p\hspace{0.05cm}\), and given a sample \(\hspace{0.05cm}X_k = (x_{1k},...,x_{nk})^t\hspace{0.05cm}\) of \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) for each \(\hspace{0.05cm}k \in \lbrace 1,...,p \rbrace\).

The Pearson linear correlation between the variables \(X_k\) and \(X_r\) is defined as:

\[ r(X_k,X_r) = \frac{S(X_k,X_r)}{S(X_k) \cdot S(X_r)} \] \(\\[0.25cm]\)

3.18.1 Properties of Pearson linear correlation

  • \(r(X_k,X_r) \in [-1,1]\) \(\\[0.5cm]\)

  • \(r_{X_k,a + b\cdot X_r} = r(X_k,X_r)\) \(\\[0.5cm]\)

  • The sign of \(r(X,X)\) is equal to the sign of \(S(X_k,Xr)\) \(\\[0.5cm]\)

  • $r(X_k,X_r) = $ perfecto linear relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)

  • \(r(X_k,X_r) = 0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\) There is not linear relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)

  • \(r(X_k,X_r) \rightarrow \pm 1 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\) hard linear relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)

  • \(r(X_k,X_r) \rightarrow 0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\) weak linear relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)

  • \(r(X_k,X_r) >0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\) positive relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)

  • \(r(X_k,X_r) <0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\) negative relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)

3.19 Pearson Correlation Matrix

The Pearson correlation matrix of the data matrix \(X=(X_1 ,..., X_p)\) is : \(\\[0.25cm]\)

\[ R_X =\bigl( \hspace{0.12cm} r_{k,r} \hspace{0.12cm} : \hspace{0.12cm} k,r\in \lbrace 1,...,p \rbrace \hspace{0.12cm} \bigr) \] \(\\[0.25cm]\)

where: \(\hspace{0.2cm} r_{i j} = r(X_i , X_j) \hspace{0.1cm}\) , for \(\hspace{0.12cm} i,j=1,...,p\) \(\\[0.35cm]\)

Matrix expression of the correlation matrix

\[ R_X= D_s^{-1} \cdot S_X \cdot D_s^{-1} \]

where: \[ D_s \hspace{0.05cm} = \hspace{0.05cm} \text{diag} \left( \hspace{0.05cm} \sigma(X_1) ,..., \sigma(X_p) \hspace{0.05cm} \right) \] \(\\[0.5cm]\)

3.20 Absolute Frequency

Given the statistical variables \(\hspace{0.07cm}\mathcal{X}_k\hspace{0.05cm}\), given a sample \(\hspace{0.05cm}X_k = (x_{1k},...,x_{nk})^t\hspace{0.07cm}\) of \(\hspace{0.07cm}\mathcal{X}_k\hspace{0.03cm}\).

3.20.1 Absolute Frequency of an element

Given \(\hspace{0.07cm} b \in Range(\mathcal{X}_k)\).

The absolute frequency of the element \(\hspace{0.07cm}b\hspace{0.07cm}\) in \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as :

\[ F_A(b ,X_k) \hspace{0.1cm}=\hspace{0.1cm} \# \hspace{0.05cm} \Bigl\{ \hspace{0.1cm} i \in \lbrace 1,... , n \rbrace \hspace{0.1cm} : \hspace{0.1cm} x_{ik}=b \hspace{0.1cm} \Bigl\} \]

Observation:

If \(\hspace{0.05cm}\) \(\mathcal{X}_k\) \(\hspace{0.05cm}\) is continuous, usually \(\hspace{0.05cm}\) \(F_A(b , X_k) = 0\) \(\hspace{0.05cm}\) for many values \(\hspace{0.05cm}\) \(b\) \(\\[0.4cm]\)

3.20.2 Absolute frequency of a set

Given \(\hspace{0.05cm}B \subset Range(\mathcal{X}_k)\)

The absolute frequency of the set \(\hspace{0.05cm}B\hspace{0.05cm}\) in \(\hspace{0.05cm}X_k\hspace{0.05cm}\) is defined as:

\[ F_A(B, X_k) = \sum_{b \in B} F_A(b , X_k ) = \]

Observation:

\(F_A([c_1,c_2], X_k)\) \(\hspace{0.08cm}\) is a particular case of \(\hspace{0.08cm}\) \(F_A(B, X_k)\) \(\hspace{0.08cm}\) with \(\hspace{0.08cm}\) \(A=[c_1,c_2]\) \(\\[0.4cm]\)

3.21 Relative Frequency

Given the statistical variables \(\hspace{0.07cm}\mathcal{X}_k\hspace{0.05cm}\), given a sample \(\hspace{0.05cm}X_k = (x_{1k},...,x_{nk})^t\hspace{0.07cm}\) of \(\hspace{0.07cm}\mathcal{X}_k\hspace{0.03cm}\).

3.21.1 Relative frequency of an element

Given \(\hspace{0.07cm}b \in Range(\mathcal{X}_k)\)

The relative frequency of the element \(\hspace{0.07cm}b\hspace{0.07cm}\) in \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as :

\[ F_{Re}(b,X_k) \hspace{0.07cm}=\hspace{0.07cm} \dfrac{F_A(b,X_k) }{n} \] \(\\[0.4cm]\)

3.21.2 Relative frequency of a set

Given \(\hspace{0.07cm}A \subset Range(\mathcal{X}_k)\).

The relative frequency of the set \(\hspace{0.07cm}B\hspace{0.07cm}\) in \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as:

\[ F_{Re}(A,X_k) \hspace{0.07cm}=\hspace{0.07cm} \dfrac{F_A(B ,X_k) }{n} \] \(\\[0.4cm]\)

3.22 Cumulative Absolute Frequency

The cumulative absolute frequency of the element \(b\) in \(X_k\) is defined as:

\[ F_{CumA}(b ,X_k) \hspace{0.07cm}= \hspace{0.07cm} F_A \left( \lbrace i=1,...,n \hspace{0.07cm} : \hspace{0.07cm} x_{ik} \leq b \rbrace , X_k \right) \] \(\\[0.4cm]\)

3.23 Cumulative Relative Frequency

The cumulative relative frequency of the element \(b\) in \(X_k\) is defined as:

\[ F_{CumRe}(b,X_k)= \dfrac{F_{CumA}(b,X_k)}{n} \] \(\\[0.4cm]\)

3.24 Frequency Table

A frequency table is a table that contains the absolut, relative and also cumulative frequencies of a statistical variable.

\(\\[0.4cm]\)

4 Statistical Description Protocol for Quantitative Variables

mean, median, variance, cuantiles, kurtosis, skewness, outliers

frequency tables –> https://www.statology.org/frequency-tables-python/

5 Statistical Description Protocol for Categorical Variables

mode, quantiles

frequency tables

6 Statistical Description Protocol for Variable Crossings (cruces de variables cuantis-categoricas, categroicas-categoricas, cuantis-cuantis)

quantitative-categorical –> mean, median, vaariance, quantiles etc BY GROUPS. Joint and conditional frequency tables.

categorical-categorical –> Joint and conditional frequency tables.

quantitative-quantitative –> transform to categorical-categorical case.

7 Statistical visualization

7.1 Visualization Protocol for Quantitative Variables

7.2 Visualization Protocol for Categorical Variables

7.3 Visualization Protocol for Quantitative-Categorical

7.4 Visualization Protocol for Categorical-Categorical




8 Descripción Estadistica Básica

A continuación vamos a realizar una descripción estadistica básica de las variables, a traves de dicersos estadisticos básicos.

8.1 Estadisticos básicos para las variables cuantitativas

Para las variables cuantitativas:

Netflix_Data.describe()
release_year runtime seasons imdb_score imdb_votes tmdb_popularity tmdb_score
count 5850.000000 5850.000000 2106.000000 5368.000000 5.352000e+03 5759.000000 5539.000000
mean 2016.417094 76.888889 2.162868 6.510861 2.343938e+04 22.637925 6.829175
std 6.937726 39.002509 2.689041 1.163826 9.582047e+04 81.680263 1.170391
min 1945.000000 0.000000 1.000000 1.500000 5.000000e+00 0.009442 0.500000
25% 2016.000000 44.000000 1.000000 5.800000 5.167500e+02 2.728500 6.100000
50% 2018.000000 83.000000 1.000000 6.600000 2.233500e+03 6.821000 6.900000
75% 2020.000000 104.000000 2.000000 7.300000 9.494000e+03 16.590000 7.537500
max 2022.000000 240.000000 42.000000 9.600000 2.294231e+06 2274.044000 10.000000


8.2 Estadisticos básicos para las variables categóricas

Para las variables categóricas (no cuantitativas, en general) :

Netflix_Data.loc[: , ['title', 'description', 'age_certification', 'genres', 'production_countries' ]].describe()
title description age_certification genres production_countries
count 5849 5832 3231 5850 5850
unique 5798 5829 11 1726 452
top The Gift Five families struggle with the ups and downs … TV-MA [‘comedy’] [‘US’]
freq 3 2 883 484 1959


8.3 Gráficos conjuntos para las variables cuantitativas

En esta seccion vamos a hacer un analisis gráfico básico de las variables cuantitativas, consideradas de manera conjunta.

Cargamos las librerias necesarias para los gráficos:

import numpy as np

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

8.3.1 Histograma conjunto de las variables cuantitativas

Vamos a generar un grafico con un histograma para cada una de las variables cuantitativas:

fig, axs = plt.subplots(3, 3, figsize=(11, 11))

p1 = sns.histplot(data=Netflix_Data, x="release_year", stat="proportion", bins=15, color="skyblue", ax=axs[0, 0])
 

p2 = sns.histplot(data=Netflix_Data, x="runtime", stat="proportion", bins=15, color="olive", ax=axs[0, 1])
p2.axes.set(xlabel='runtime', ylabel=' ')
 

p3 = sns.histplot(data=Netflix_Data, x="seasons", stat="proportion", bins=15, color="blue", ax=axs[0, 2])
p3.axes.set(xlabel='seasons', ylabel=' ')
 

p4 = sns.histplot(data=Netflix_Data, x="imdb_score", stat="proportion", bins=15, color="teal", ax=axs[1, 0])
p4.axes.set(xlabel='imdb_score', ylabel=' ')
 

p5 = sns.histplot(data=Netflix_Data, x="imdb_votes", stat="proportion", bins=15, color="purple", ax=axs[1, 1])
p5.axes.set(xlabel='imdb_votes', ylabel=' ')
 

p6 = sns.histplot(data=Netflix_Data, x="tmdb_popularity", stat="proportion", bins=15, color="pink", ax=axs[1, 2])
p6.axes.set(xlabel='tmdb_popularity', ylabel=' ')
 
 
p7 = sns.histplot(data=Netflix_Data, x="tmdb_score", stat="proportion", bins=15, color="red", ax=axs[2, 0])
p7.axes.set(xlabel='tmdb_score', ylabel=' ')
 
fig.savefig('p1.png', format='png', dpi=1200)

plt.show()


Histograma conjunto de las variables cuantitativas


8.3.2 Box-Plot conjunto de las variables cuantitativas

Vamos a generar un grafico con un box-plot para cada una de las variables cuantitativas:

fig, axs = plt.subplots(3, 3, figsize=(11, 11))

p1 = sns.boxplot(data=Netflix_Data, x="release_year", color="skyblue", ax=axs[0, 0])
 

p2 = sns.boxplot(data=Netflix_Data, x="runtime",  color="olive", ax=axs[0, 1])
p2.axes.set(xlabel='runtime', ylabel=' ')
p2.set_xticks( range(int(Netflix_Data['runtime'].min()) , int(Netflix_Data['runtime'].max()) , 100) )
p2.set_yticks( np.arange(0, 1, 0.1)  )

p3 = sns.boxplot(data=Netflix_Data, x="seasons", color="blue", ax=axs[0, 2])
p3.axes.set(xlabel='seasons', ylabel=' ')
 

p4 = sns.boxplot(data=Netflix_Data, x="imdb_score", color="teal", ax=axs[1, 0])
p4.axes.set(xlabel='imdb_score', ylabel=' ')
p4.set_xticks( range(int(Netflix_Data['imdb_score'].min()) , int(Netflix_Data['imdb_score'].max()) , 300) )
p4.set_yticks( np.arange(0, 1, 0.1)  )

p5 = sns.boxplot(data=Netflix_Data, x="imdb_votes", color="purple", ax=axs[1, 1])
p5.axes.set(xlabel='imdb_votes', ylabel=' ')
p5.set_xticks( range(int(Netflix_Data['imdb_votes'].min()) , int(Netflix_Data['imdb_votes'].max()/2) , 500000) )
p5.set_yticks( np.arange(0, 1, 0.1)  )

p6 = sns.boxplot(data=Netflix_Data, x="tmdb_popularity", color="pink", ax=axs[1, 2])
p6.axes.set(xlabel='tmdb_popularity', ylabel=' ')
p6.set_xticks( range(int(Netflix_Data['tmdb_popularity'].min()) , int(Netflix_Data['tmdb_popularity'].max()+1) , 1000) )
p6.set_yticks( np.arange(0, 1, 0.1)  )
 
p7 = sns.boxplot(data=Netflix_Data, x="tmdb_score", color="red", ax=axs[2, 0])
p7.axes.set(xlabel='tmdb_score', ylabel=' ')
p7.set_xticks( range(int(Netflix_Data['tmdb_score'].min()) , int(Netflix_Data['tmdb_score'].max()+1) , 2) )
p7.set_yticks( np.arange(0, 1, 0.1)  )

plt.show()


Box-Plot conjunto de las variables cuantitativas


8.3.3 Empirical-Cumulative-Distribution-Function-Plot conjunto de las variables cuantitativas

Vamos a generar un grafico con un ECDF-plot para cada una de las variables cuantitativas:

fig, axs = plt.subplots(3, 3, figsize=(11, 11))

p1 = sns.ecdfplot(data=Netflix_Data, x="release_year", color="skyblue", ax=axs[0, 0])
p1.set_xticks( range(int(Netflix_Data['release_year'].min()) , int(Netflix_Data['release_year'].max()+20) , 20) )
p1.set_yticks( np.arange(0, 1, 0.1)  )

p2 = sns.ecdfplot(data=Netflix_Data, x="runtime",  color="olive", ax=axs[0, 1])
p2.axes.set(xlabel='runtime', ylabel=' ')
p2.set_xticks( range(int(Netflix_Data['runtime'].min()) , int(Netflix_Data['runtime'].max()) , 100) )
p2.set_yticks( np.arange(0, 1, 0.1)  )

p3 = sns.ecdfplot(data=Netflix_Data, x="seasons", color="blue", ax=axs[0, 2])
p3.axes.set(xlabel='seasons', ylabel=' ')
p3.set_xticks( range(int(Netflix_Data['seasons'].min()) , int(Netflix_Data['seasons'].max()) , 4) )
p3.set_yticks( np.arange(0, 1, 0.1)  )

p4 = sns.ecdfplot(data=Netflix_Data, x="imdb_score", color="teal", ax=axs[1, 0])
p4.axes.set(xlabel='imdb_score', ylabel=' ')
p4.set_xticks( range(int(Netflix_Data['imdb_score'].min()) , int(Netflix_Data['imdb_score'].max()) , 300) )
p4.set_yticks( np.arange(0, 1, 0.1)  )

p5 = sns.ecdfplot(data=Netflix_Data, x="imdb_votes", color="purple", ax=axs[1, 1])
p5.axes.set(xlabel='imdb_votes', ylabel=' ')
p5.set_xticks( range(int(Netflix_Data['imdb_votes'].min()) , int(Netflix_Data['imdb_votes'].max()/2) , 500000) )
p5.set_yticks( np.arange(0, 1, 0.1)  )

p6 = sns.ecdfplot(data=Netflix_Data, x="tmdb_popularity", color="pink", ax=axs[1, 2])
p6.axes.set(xlabel='tmdb_popularity', ylabel=' ')
p6.set_xticks( range(int(Netflix_Data['tmdb_popularity'].min()) , int(Netflix_Data['tmdb_popularity'].max()+1) , 1000) )
p6.set_yticks( np.arange(0, 1, 0.1)  )
 
p7 = sns.ecdfplot(data=Netflix_Data, x="tmdb_score", color="red", ax=axs[2, 0])
p7.axes.set(xlabel='tmdb_score', ylabel=' ')
p7.set_xticks( range(int(Netflix_Data['tmdb_score'].min()) , int(Netflix_Data['tmdb_score'].max()+1) , 50) )
p7.set_yticks( np.arange(0, 1, 0.1)  )

plt.show()


ECDF-Plot conjunto de las variables cuantitativas


8.4 Gráficos conjuntos para las variables categoricas

8.4.1 Bar-plot conjunto de las variables categóricas

Vamos a generar un grafico con un bar-plot para cada una de las variables categóricas, excepto para aquellas cuyo nº de categorias es excesivo, y por tanto no es práctico el gráfico:

fig, axs = plt.subplots(1, 2, figsize=(13, 6))

p1 = sns.countplot(x='type', data=Netflix_Data, ax=axs[0]) 
p1.set_xticklabels(['Movie', 'Show'])
p1.axes.set(xlabel='type', ylabel='count')

p2 = sns.countplot(x='age_certification', data=Netflix_Data, ax=axs[1]) 

plt.show()


Bar-Plot conjunto de variables categoricas


9 Análisis Estadístico

En la sección anterior se hizo una descripción estadistica básica de las variables del data-set con el que estamos trabajando, pero no se ha hecho ningun analisis de los resultados obtenidos.

En esta seccion además de ampliar la descripción estadistica de los datos, se llevará a cabo un analisis de los resultados obtenidos.